# Transforming Dataframes
Master the pandas basics. Learn how to inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns.

#### Import libraries

In [None]:
import pandas as pd

#### Import data

In [None]:
homelessness = pd.read_csv('../../data/homelessness.csv', index_col=0)
homelessness

***
<br>

# Introducing DataFrames

# Inspecting a DataFrame
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

- `.head()` returns the first few rows (the “head” of the DataFrame).
- `.info()` shows information on each of the columns, such as the data type and number of missing values.
- `.shape` returns the number of rows and columns of the DataFrame.
- `.describe()` calculates a few summary statistics for each column.

`homelessness` is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The `individual` column is the number of homeless individuals not part of a family with children. The `family_members` column is the number of homeless individuals part of a family with children. The `state_pop` column is the state's total population.
<br>
<br>
##### Instructions 1/4
- View the head of the `homelessness` DataFrame.

In [None]:
# Get the head of the homelessness data
homelessness.head()

##### Instructions 2/4
View information about the column types and missing values in `homelessness`

In [None]:
# Get information about homelessness
homelessness.info()

##### Instructions 3/4
Print the number of rows and columns in `homelessness`.

In [None]:
# Get the shape of homelessness
homelessness.shape

##### Instructions 4/4
Print some summary statistics that describe the `homelessness` DataFrame.

In [None]:
# Get a description of homelessness
homelessness.describe()

# Parts of a DataFrame
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

- `.values`: A two-dimensional NumPy array of values.
- `.columns`: An index of columns: the column names.
- `.index`: An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the pandas `Index` data type allows for more sophisticated options. (These will be covered later.)
<br>
<br>
##### Instructions
- View a 2D NumPy array of the values in `homelessness`.
- View the column names of `homelessness`.
- View the index of `homelessness`


In [None]:
# Get the values of homelessness
homelessness.values

In [None]:
# Get the column index of homelessness
homelessness.columns

In [None]:
# Get the row index of homelessness
homelessness.index

***
<br>

# Sorting and subsetting
In this section, we'll cover the two simplest and possibly most important ways to find interesting parts of your DataFrame.

# Sorting rows
Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to `.sort_values()`.

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

- one column: `df.sort_values("breed")`
- multiple columns: `df.sort_values(["breed", "weight_kg"])`

By combining `.sort_values()` with `.head()`, you can answer questions in the form, "What are the top cases where…?".
<br>
<br>
##### Instructions 1/3
- Sort `homelessness` by the number of homeless individuals, from smallest to largest, and save this as `homelessness_ind`.
- View the head of the sorted DataFrame.

In [None]:
# Sort homelessness by individual
homelessness_ind = homelessness.sort_values('individuals')

# View the top few rows
homelessness_ind.head(10)

##### Instructions 2/3
- Sort `homelessness` by the number of homeless `family_members` in descending order, and save this as `homelessness_fam`.
- View the head of the sorted DataFrame.

In [None]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values('family_members', ascending=False)

# View the top few rows
homelessness_fam.head()

##### Instructions 3/3
- Sort `homelessness` first by region (ascending), and then by number of family members (descending). Save this as `homelessness_reg_fam`.
- View the head of the sorted DataFrame.

In [None]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False])

# View the top few rows
homelessness_reg_fam.head()

# Subsetting columns
When working with data, you may not need all the variables in your dataset. Square brackets (`[]`) can be used to select only the columns that matter to you in an order that makes sense to you. To select only `col_a` of a DataFrame `df`, use `df.['col_a']`. To select `col_a` and `col_b` of `df`, use `df.[['col_a', 'col_b']]`.
<br>
<br>
##### Instructions 1/3
- Create a DataFrame called `individuals` that contains only the `individuals` column of `homelessness`.
- View the head of the result.

In [None]:
# Select the individuals column
individuals = homelessness['individuals']

# View the head of the result
individuals.head()

##### Instructions 2/3
- Create a DataFrame called `state_fam` that contains only the `state` and `family_members` columns of `homelessness`, in that order.
- View the head of the result.

In [None]:
# Select the state and family_members columns
state_fam = homelessness[['state', 'family_members']]

# View the head of the result
state_fam.head()

##### Instructions 3/3
- Create a DataFrame called `ind_state` that contains the `individuals` and `state` columns of `homelessness`, in that order.

In [None]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[['individuals', 'state']]

# View the head of the result
ind_state.head()

# Subsetting rows
A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as ***filtering rows*** or ***selecting rows***.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return `True` or `False` for each row, then pass that inside square brackets:
`dogs[dogs["height_cm"] > 60]`
`dogs[dogs["color"] == "tan"]`

You can filter for multiple conditions at once by using the "bitwise and" operator, &:
`dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]`

##### Instructions 1/3
- Filter `homelessness` for cases where the number of `individuals` is greater than ten thousand, assigning to `ind_gt_10k`.
- View the result.

In [None]:
# Filter for rows where individuals are greater than 10000
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]

# View the result
ind_gt_10k

##### Instructions 2/3
- Filter `homelessness` for cases where the USA Census region is "Mountain", assigning to `mountain_reg`.
- View the result.

In [None]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region'] == 'Mountain']

# View the result
mountain_reg

##### Instructions 3/3
- Filter `homelessness` for cases where the number of `family_members` is less than one thousand and the `region` is "Pacific", assigning to `fam_lt_1k_pac`.
- View the printed result.

In [None]:
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')]
fam_lt_1k_pac

# Subsetting rows by categorical variables

Subsetting data based on a categorical variable often involves using the "or" operator (`|`) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the `.isin()` method, which will allow you to tackle this problem by writing one condition instead of three separate ones.
`colors = ["brown", "black", "tan"]`
`condition = dogs["color"].isin(colors)`
`dogs[condition]`
<br>
<br>
##### Instructions 1/2
- Filter `homelessness` for cases where the USA census region is "South Atlantic" or it is "Mid-Atlantic", assigning to `south_mid_atlantic`.
- View the result.

In [None]:
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[(homelessness['region'] == 'South Atlantic') | (homelessness['region'] == 'Mid_Atlantic')]

# View the result
south_mid_atlantic

##### Instructions 2/2
- Filter `homelessness` for cases where the USA census `state` is in the list of Mojave states, `canu`, assigning to `mojave_homelessness`.
- View the result.

In [None]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mohave_homelessness = homelessness[homelessness['state'].isin(canu)]

# View the result
mohave_homelessness

***
<br>

# New Columns

# Adding new columns
You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as ***transforming***, ***mutating***, and feature ***engineering***.
You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.
<br>
<br>
##### Instructions
- Add a new column to `homelessness`, named `total`, containing the sum of the `individuals` and `family_members` columns.
- Add another column to `homelessness`, named `p_individuals`, containing the proportion of homeless people in each state who are individuals.

In [None]:
# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']

# Add p_individuals col as proportion of individuals
homelessness['p_individuals'] = homelessness['individuals'] / homelessness['total']

# See the result
homelessness.head()

# Combo-attack!
You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new `pandas` skills to find out.

<br>

##### Instructions
1. Add a column to `homelessness`, `indiv_per_10k`, containing the number of homeless individuals per ten thousand people in each state.
2. Subset rows where `indiv_per_10k` is higher than `20`, assigning to `high_homelessness`.
3. Sort `high_homelessness` by descending `indiv_per_10k`, assigning to `high_homelessness_srt`.
4. Select only the `state` and `indiv_per_10k` columns of `high_homelessness_srt` and save as `result`. Look at the result.

In [None]:
# Create indiv_per_10k column as homeless individuals per 10k state_pop
homelessness['indiv_per_10k'] = 10000 * homelessness['individuals'] / homelessness['state_pop']

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness['indiv_per_10k'] > 20

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = homelessness[high_homelessness].sort_values('indiv_per_10k', ascending=False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[['state', 'indiv_per_10k']]

# View the result
result